Reconstructing historic and modern potato late blight outbreaks using text analytics

In 1843, a hitherto unknown plant pathogen entered the US and spread to potato fields in the northeast. By 1845, the pathogen had reached Ireland leading to devastating famine. Questions arose immediately about the source of the outbreaks and how the disease should be managed. The pathogen, now known as Phytophthora infestans, still continues to threaten food security globally. A wealth of untapped knowledge exists in both archival and modern documents, but is not readily available because the details are hidden in descriptive text. In this work, we (1) used text analytics of unstructured historical reports (1843–1845) to map US late blight outbreaks; (2) characterized theories on the source of the pathogen and remedies for control; and (3) created modern late blight intensity maps using Twitter feeds. The disease spread from 5 to 17 states and provinces in the US and Canada between 1843 and 1845. Crop losses, Andean sources of the pathogen, possible causes and potential treatments were discussed. Modern disease discussion on Twitter included near-global coverage and local disease observations. Topic modeling revealed general disease information, published research, and outbreak locations. The tools described will help researchers explore and map unstructured text to track and visualize pandemics.


Reconstructing historic and modern potato late blight outbreaks using text analytics
In 1845, Ireland's potato crop was destroyed by a plant disease.At the time, the germ theory of disease was not understood and the cause of the disease was unclear 1 .The pathogen destroyed potatoes and plagued Ireland for 7 years leading to mass starvation of the Irish people, emigration, and the Irish Potato Famine 2 .However, the first reports of the disease occurred in the United States in 1843 around the ports of Philadelphia and New York 3,4 .The plant disease pandemic spread over a three year period to more states and provinces in the northeast US and Canada and by 1845 was reported in Europe, the UK and Ireland.
Many theories were proposed at the time to explain the pandemic including a curse from God, bad weather, the laziness of the Irish, or a minute fungus [5][6][7] .M. J. Berkeley would ultimately elucidate the cause and named the fungus-like pathogen Botrytis infestans.Later, Anton DeBary would publish the complete life cycle and rename the pathogen Phytophthora infestans 8,9 .Initially, the center of origin of the disease and source of the nineteenth century outbreaks was suspected to be from South America 5,6 .Later, competing Competing theories suggest either a Mexican source [10][11][12][13] or an Andean source [14][15][16][17] .A hybrid theory suggesting Mexico as the center of origin of the disease but the Andean region as the source of the outbreak strain has also been suggested by several authors 3,6,18 .Historic herbarium specimens have been used to identify and track the outbreak strain [19][20][21][22][23][24] and genomic sequence data support an Andean source 22 .
The presence of the disease was recorded extensively in nineteenth century newspapers, letters, and government reports, many of which were documented in the annual US Commissioner of Patent Reports [25][26][27] .Prior to the formation of the US Department of Agriculture, agricultural reports including crop status and yield information were collected in the Patent Office reports.These documents included descriptions of location information as growers reported symptoms on their farms and chronicled their attempts at controlling the disease.The content also included excerpts from agricultural newspapers.These reports represent a valuable resource of information to better inform our understanding of the source of this important plant disease during the first outbreaks in the early days of the plant disease pandemic in the US prior to its spread to Europe and Ireland.Previously published maps have only inferred the general location and direction of spread of the disease over time on a broad scope and based on manual examination of the texts [4][5][6] .
Given the narrative nature of the patent reports and the sheer volume of information (over 6000 pages between 1841 and 1850), we used data analytics and natural language processing (NLP) to efficiently extract

Impact of late blight in the US
We conducted an analysis of the frequency of key words and phrases relating to potato blight in the US Annual Commissioner of Patent Reports [25][26][27] .Lexical dispersion plots of data retrieved from the 1843 to 1845 reports revealed an increase in the frequency of words like "rot", "blight", and "disease" (Fig. 1).These keywords occurred in the text in close proximity to the keyword "potato", starting in 1844, suggesting an increase in the occurrence of the disease from the previous year.The frequency change also implies an increase in discussion about the disease in the reports as growers and scientists became aware of the new disease affecting potato crops and discussed the disease at local agricultural society meetings.
We also examined the reports of the potato disease with Google Ngram which included a broader look at literature from the time period.A Google Ngram search of the Google Books Corpus for three 2-g ("potato disease", "potato rot", "late blight") revealed a spike in the occurrence of "potato disease" and "potato rot" around the mid-1840s that coincided with the Irish Famine outbreaks (Fig. 2).A second spike for "potato disease" was noted around the 1870s, which correlated with a second major outbreak of late blight that was recorded in historic literature 38 .A third spike, almost as large as the 1840 spike, occurred in the 1940s during the time period of World War II.The term "late blight" did not appear in the word corpus until around 1900, possibly in conjunction with both the observation of disease occurrence in late maturing varieties and the start of breeding efforts to produce blight-resistant potatoes that had delayed disease until "late" in the season.Through keyword searches, we located sections of interest within the patent reports, which could then be examined in relation to theories on cause and source of the disease and means of control (Table 1).The letters and reports reveal the concern and alarm within the US agricultural community in 1844 and 1845 as it becomes clear that the disease observed in 1843 was not an isolated incident but increased in spread and occurrence with time and became a pandemic (Fig. 3).Terms used to describe the disease included "evil", "murrain" and "plague" indicating the serious nature of the problem (Fig. 1).The Massachusetts Ploughman agricultural newspaper is documented in the 1844 report as stating "Have you found out the cause of the strange rot in the potato crop?This is the anxious inquiry of almost every farmer we converse with."Numerous accounts of the struggles of scientists and farmers to understand the cause of the disease and potential treatments were also identified, chronicling the debate over the major theories of the cause of the disease that included insects, poor nutrition, bad weather, worn out varieties and seed of potato, or a minute fungus.The work of J. E. Teschemacher, an amateur botanist, is reported in detail in the 1844 reports including the symptoms of the disease on potato, microscopic observations and transmission of the "fungus' (Table 1) 26 .Teschemacher subsequently published this report in the Gardeners Chronicles 39 .
Keyword searches also revealed text passages in which farmers were actively sharing information and developing methods to control the disease, including the use of lime, ash, salt, bluestone copper treatment of tubers, tobacco juice to ward off insects, and application of sulfur (Table 1).It was often mentioned that manured land had greater disease than non-manured land.Since tubers were rotting in the ground, many mentions of methods to store potatoes and keep them dry over the winter are recorded in the reports.The text analytics also documented how the disease affected different varieties of potatoes.A writer to the Bangor Whig newspaper indicated in the 1844 report that "…the Chenango variety, which has been cultivated by farmers for a long series of years, has been the most affected by the disease while The English whites and long reds have not suffered so much".
Examination of name locations identified in close proximity to disease keywords also revealed potential sources of introduction of the disease, based on local accounts.A writer to the Bangor Whig newspaper indicated in the 1844 report that "The farmers of Nova Scotia, who shipped large quantities of this root to Boston during the past season, have long been acquainted with the disease, and call it the rot" (Table 1).This passage suggests that the pathogen may have been present in North America earlier than 1843 and that possible movement of the pathogen into the US could have occurred via Canadian potatoes (Table 1).There is also evidence suggesting that growers were obtaining new seed potatoes from South America and from Europe, due to a dry rot that was affecting potato tuber quality.There is mention of the disease having been present in South America in the 1845 report 27 , in which it is documented based on a letter from Joachim Acosta that "…the malady is very common on the table land of Bogota, Columbia; that it is destructive in wet seasons, or even every year in damp spots" (Table 1).

Spread of late blight in the US, 1843-1845
We used geographic locations extracted from the patent report text to generate maps of the spread of late blight in the US and southeastern Canada from 1843 to 1845.These maps represent, to our knowledge, the most accurate spatial reconstruction to date of the movement of the pathogen from the first known reports in the US in 1843 (Fig. 3).Five unique locations in different states where the potato disease was present were identified in 1843, with accounts limited to states around New York and surrounding regions, including Pennsylvania, New Jersey, Delaware, and Connecticut (Fig. 3a).In 1844, the number of unique locations identified increased to 107 and the pandemic expanded to include six other states including Ohio, Massachusetts, Rhode Island, Vermont, New Hampshire, and Maine as well as the Canadian province of Nova Scotia (Fig. 3b).There were fewer new reports in 1845, with only 53 unique locations identified, but the number of states reporting disease continued to expand, including Michigan, Illinois, Indiana, Maryland, and the Canadian province of New Brunswick (Fig. 3c).Over time reports increase in frequency in the major potato growing regions and the surrounding northeastern US states.Mention is made of disease occurrence near rivers (Delaware, Connecticut rivers) which were often used to www.nature.com/scientificreports/transport seed tubers into rural areas.Interestingly, the disease is still present in many of these same northeastern states as the maps of cumulative data from the USABlight.orgreporting system indicates 17,30 .

Modern day records of disease recorded from Twitter
We obtained 10 years of recent global Twitter posts mentioning the pathogen with a total of 41,720 posts in 39 languages (81% English), 22,879 with geographic locations from 132 countries (Fig. 4a).Excluding Retweets and posts attributed to a single scholarly literature delivery service (EurekaMag) reduced this number to 11,570 posts.7615 of these posts contained geographic location names.The mentioned locations included places from 107 distinct countries (Fig. 4b).Most topics produced with topic modeling described research advances (e.g., "potato disease potatoes resistant farmers crops help crop could new", "effector rxlr host plant potato targets via avr3a immune effectors") and general information about the disease (e.g., "potato Irish caused disease famine million crop cause blight Ireland").However, a single frequent topic in 6.26% of posts identified potential disease occurrence reports.The topic modeling represented this topic by its 10 most probable words: "tomato potato found confirmed rt county growers disease strain reported".Further targeting this topic, we extracted example posts that included descriptions of late blight presence and the emergence of particular strains at specifically named locations (e.g., "Late blight found on potato in Adams County in central Wisconsin.Testing carried out by UW confirmed the strain was US-23.") and used this topic to train supervised machine learning classification.Unique Tweets in all languages were translated to English and 285 were labeled (late blight report, yes/no) by authors.Text classification identified 1931 relevant posts that include reports of P. infestans.Table 1.Specific quotes related to the cause, source and means of treating the nineteenth century potato disease.

Subjects
Topics Quotes

Causes of disease Worn out varieties
The better opinion in those countries is, that by long propagation from the tuber, without recurring to the natural seed of the plant, it has lost a portion of its vital power, and hence is extremely prone to blight, rust, and to rot Fungus Mr. Teschemacher, of Boston, tried, by microscopic examination, to find out the cause.He discovered in the potato a growth of fungus, which is a plant analogous to the mushroom family.These hung in seeds are invisible to the naked eye; they are readily carried about by winds, and will penetrate wherever air will.
Being once introduced from Europe, their extensive dissemination here is very easy.These seeds falling on the potato, in favorable circumstances as to moisture, & etc., cause the disease

Weather
Several theories have formerly prevailed on this subject.but, so far as we are acquainted, the most generally received one at the present time is, that a peculiar state of the atmosphere occasions the bursting of the sap vessels of the plant, by which the sap is exuded to the outside of the stalk and leaf.The sap, then, becoming acrid, together with the derangements in the functions of the plant, is supposed to occasion the blight or rust in grain Crop loss estimates 33-50% or greater in some states The loss last season from rot exceeded $700,000, and we fear that this year it will fall but little below $1,000,000.
In no case have we heard of an increased crop, but the language, as applied to different sections, is" nearly 50 per cent less, owing to a rot which seized them before the time for taking them out of the ground

Methods of control
Lime, drying tubers in sand, ash In 1843, after being put into the cellar; and so he picked out those that were affected, and put half a peck of soaked lime to each layer of the others, and they kept well.In planting, last spring, he put a table spoonful of lime in each hill; and after they were up, and before hilling, he applied to each hill about a will of a mixture of lime 2 bushels, plaster 3, and ashes 8.He had not one rotten potato

Salt
In Pennsylvania, it (rot) prevailed in the moist mountainous region.A friend of his used about a half teaspoon of salt to a hill of potatoes when they began to set to prevent injury from worms:, and they were excellent where salt was used, the others not worth harvesting

Manure
Towards the southern boundary, also, the report is: "Greatly injured by the rot; all of one-third of the crop was destroyed They suffered most on manured lands; on lands not manured, they escaped the disease entirely."

Geography
In the central western part of the state, on the Connecticut river, the potatoes were early struck with rust, and nearly one-third of the crop has rotted."..for 1844, already given, that the potato rot or disease (as it is called) has extended and prevailed far more than in the year previous.So great has been the evil, that it excites serious apprehensions, unless something may be found to prove an effectual remedy Sources of infected seed potatoes

Nova Scotia
The farmers of Nova Scotia, who shipped large quantities of this root to Boston during the past season, have long been acquainted with the disease, and call it the rot.It seems to pervade particular farms; and sometimes appears in the stalk, like rust, long before the potato has arrived at maturity; and, on cutting open the young root, the disease will be found to exhibit itself as black spots throughout the inside of it French potatoes French potatoes, received three weeks before directly from Harve….on 3 rd of October, ploughed and all perfectly sound with the exception of pink eyed kidneys and French potatoes, which were entirely rotten

Origin of the disease Bogota Andean source
The malady is very common on the table land of Bogota; that it is destructed in wet seasons, or even every year in damp spots

Variety differences Some varieties more susceptible
The Chenango variety, which has been cultivated by farmers for a long series of years, has been most affected by the disease.The English whites and long reds have not suffered so much, because they have less constitutional defects; but these 'have for some years shown strong symptoms of decay in intelligent farmer of our acquaintance corroborates the opinion advanced in the above paragraphs, attributing the disease to the constant re planting of the same seed Vol.:(0123456789)The geographic origin of the post is derived from a geolocation when provided by the user (0.05% of late blight posts) but for the remaining posts, we inferred this information from the user's profile or other heuristics at coarser geographic scales (i.e., state, country) 40 .However, the text content of posts contained additional, different, and more precise, geographic information useful in describing where an outbreak was occurring 41,42 .To this end, we applied a geoparsing pipeline using named entity recognition (NER) to extract locations from text classified as reports and using disambiguation to assign coordinates to the extracted locations.
Automatically geolocating place names is complicated by the fact that many place names are used to refer to multiple distinct locations.Disambiguation aims to correctly select the true location being discussed.We considered three ways to impute the context of posts to disambiguate locations: using no additional context, the country of origin as context, or additional locations mentioned nearby in text as context (see detailed approach in Methods).The NER analysis yielded 629 reports that included place name entities.Our disambiguation approach resulted in 575 posts including 602 geolocations at the country scale, with 429 posts mentioning 456 locations identified at the state scale, and 272 posts mentioning 283 locations identified at the city-or county-scale.Manual evaluation of positive reports identified 55 false positive reports (8.7% error rate for classification), 171 mis-identified places (21.1% error rate for geoparsing) and 52 additional missed places in posts.After correction, the United States accounted for 55.9% of places reported in posts, followed by Canada (13.8%), the United Kingdom (4.4%), India (4.2%), Bangladesh (3.5%), Nigeria (2.4%),China (2.0%), Ireland (1.5%), Peru (1.3%) and Uganda (1.1%) (Fig. 5A).In the contiguous United States and Canada, there were 395 validated reports from posts mentioning states and 201 with city or county-scale locations (Fig. 5B) when cities and counties were aggregated to 60 × 60 km grid cells by centroid location.We aggregated omission and commission errors in the automatically classified data by country and state (Fig. 5A,B, respectively).The locations reported provide a new supplementary dataset of spatiotemporal disease observations.

Discussion
The text analytics of US Commissioner of Patent Office Reports revealed that the potato disease was first reported in the US in 1843 in a five-state area.The pathogen and subsequent disease spread rapidly in potato fields and by 1844 had expanded to the neighboring states in the northeast.In 1845, first reports of the disease into Europe, Scotland, and Ireland occurred, resulting in the Irish diaspora, mass starvation, death, and emigration of people from Ireland.Since the fungus-like plant pathogen was unknown to science at the time, the cause and source of the disease as well as remedies for control were widely discussed in agricultural reports, newspapers, and government pamphlets.In this work we have created the first accurate maps of the 1843-1845 outbreaks of potato late blight in the US.Our text analytics has also revealed information on theories of the cause and methods of control.We developed a workflow for digitalizing historical documents, extracting information of interest and spatial data, and visualizing them on an interactive map for verification and analysis (Supplemental Fig. 1).The methodology accelerates the process of exploring voluminous unstructured text for the purposes of tracking disease observations over time and space based on key words and phrases.This method can be scaled to trace the movement of historic outbreaks of other diseases where historic text documents are available.We visualized the spread of late blight at a level of detail not previously used, and demonstrated the spread of disease from a  Vol:.(1234567890) In addition to geographic information, the targeted searches allowed us to locate passages of interest in the agricultural papers and meetings of agricultural societies of the time that described the efforts of the agricultural community, landowners, and amateur scientists to understand the agent that caused the potato disease and manage its consequences 26 .The observations on the "fungus" causing the disease first described in the Patent Office reports by Teschemacher and later by him in The Gardeners Chronicles 39 preceded work done by Louis Pasteur to define the germ theory of disease 43 and before the potato blight pathogen was officially identified and named by Berkeley 5 .These passages give us better insight into the contemporaneous ideas on the fungal nature of disease.Reports widely discuss the potato fungus as a consequence rather than cause of the disease, as the contagion nature of a plant pathogen was yet to be accepted 5,7 .Over time, the work of DeBary conclusively documented the life cycle of the pathogen 6,8,9 .
Due to the nature of the source material, we also observed mentions of grower anxiety and frustration as they contended with this new threat to their potato crops, summed up quite succinctly by Dr. Saml L. Dana in a letter from the 1844 Patent Office report 26 in which he stated, "I want light.".Terms were repeatedly used to describe the disease such as evil, murrain, cholera, and plague, reflecting some understanding of the contagion nature of the disease, but also the hopelessness of not having a clear answer as to the cause or prevention of further disease.Methods and strategies used for disease management and those that showed promise of success were often mentioned.The onset of the first potato blight outbreaks in the US and Europe led to the nascent formation of chemical control strategies to manage plant diseases.Growers used materials at hand on farms including lime, sulfur, bluestone copper steeps of seed potatoes, and salt, and many experimental trials are described in the reports (Table 1).
It is also clear from the reports that infected potato seed tubers were suspected as a source of the disease.The mention of seed tuber sources being imported from outside the US including from France, South America, and Nova Scotia is mentioned.There was a booming bat guano trade of Peruvian bat guano as fertilizer in the 1840s and shipments of potato tubers on those same steamships in days rather than weeks from South America would have enabled the pathogen to survive and enter US ports 3,4,6 .
The timing of the emergence of several distinct global potato blight disease pandemics over a 150-year time period was also confirmed through the use of keyword searches in Google Ngram with the larger corpus of source information.We used 2-g word ("potato rot", "potato disease" and "late blight") associated with the disease to search the broader literature.Google Ngram plots of late blight, potato blight and potato rot mentions in published literature revealed the 1845 peak in reports of the potato disease associated with the European and Irish outbreaks and a smaller peak in the 1870s when a second wave of severe late blight was known to have occurred in Europe 38 .A plethora of reports on the cause and prevention were published during that time period.The Ngram search also revealed a peak in late blight mentions in the 1940s.Underlying data revealed that "In 1946, Phytophthora infestans, without warning, caused an estimated $40,000,000 loss in tomatoes in the United States almost a century after the first reports in the US 44 .Although the suddenness and extent of damage was severe in the US, the consequences were not as disastrous as nineteenth century disease in Ireland where potato was a sole food source for millions.
Contemporary surveillance of plant disease and subsequent deployment of management strategies requires up-to-date information about where disease is spreading.The immediacy of Tweets offers an opportunity for abundant, low latency data streams with extensive geographic data coverage, both within the U.S. and globally.Our Twitter data showed global engagement in conversation about late blight, but with the greatest volume of posts originating from the U.S. and Europe where late blight research and monitoring efforts are particularly active 17 .Refining posts to geolocated disease reports provided additional details useful to tracking the disease at the state level, and more occasional finer scale data (county, city).Both information extracted from text and locational data provided with Twitter posts offer potential new reports of the disease that might be missed by active surveillance.However, these methods still present challenges for disambiguating distinct locations from same-named places and manual revision was required to validate and improve the quality of data we produced.To analyze the errors, we evaluated specific cases where geocoding mis-identified geolocations in areas where late blight is not known to be present.For example, several counties in other US states were erroneously coded as same-named counties in large states like Texas when the correct state name was not included as context.Abbreviated place names at lower administrative levels without context (e.g., PA and VA for U.S. states Pennsylvania and Virginia) were geolocated as same-abbreviated countries (e.g., Panamá and Vatican City).In the future, custom-trained machine learning and NLP models could be developed to automatically and efficiently process disease observations from Twitter data sources and to support more concerted disease surveillance efforts like USABlight.org 30that rely on human reviewers to validate new data before it is mapped.
By leveraging large historic data records and more recent Twitter feeds for data analytics we have gained insight into both the epidemiological and societal impact of the first late blight pandemics in the US and better understand where modern global occurrences of the disease are now frequently reported.Extracting data from both of these voluminous sources required NLP techniques to identify themes of conversations and late blight reports containing geographic data.These sources presented certain distinct challenges (e.g., the spelling errors introduced by the physical condition of the historical documents and the URLs and hashtags in the social media posts required tailored preprocessing approaches) and opportunities (e.g., the prose form of the historical documents provides extensive context for each excerpt, while social media posts include metadata, such as geotags and user profile location, for extra context).Our work provides examples of how to overcome some of these challenges and leverage some of the opportunities.The tools we describe here will be open source and archived for future researchers interested in tracking and using social media and unstructured data as predictive intelligence for pandemic prevention.www.nature.com/scientificreports/and was correctly identified.Visualization on a map allowed subject experts to easily see if a point appeared out of place despite a correct location identification (e.g."Oxford" referring to Oxford, England instead of Oxford, Massachusetts).To refine the results, we defined these five categories: accept (the information is relevant to the question and correctly located), move (the information is relevant but the location needs to be corrected), uncertain (the information might be relevant but requires further scrutiny by a second expert to confirm), archive (the information may be useful to adjacent questions (e.g.discussion of potential disease treatments) but has no relevant location), and remove (the information is irrelevant to the question).The Web map provided the capability to designate a category with a dropdown menu and add comments about each entry.The curated data (those points marked "accepted" or moved to the correct location and then accepted) were then imported into ArcGIS Pro ( https:// www.esri.com/ en-us/ arcgis/ produ cts/ arcgis-pro/ overv iew) to produce georeferenced maps for late blight reports from 1843 to 1845.

Google Ngram search
In addition to searching the Patent Office reports, we used Google Ngram Viewer 50 to search for keywords in the Google Books collection for a broader view of the disease's presence.For the purposes of the search, we focused on three terms: "potato disease", "potato rot", and "late blight".The results were plotted as frequency distributions within texts published in the past 200 years.

Twitter feed analytics
We queried Twitter posts ("Tweets") between 2012 and 2022 from the commercial media aggregation service Brandwatch 40 for regular expressions of all common and scientific names of P. infestans recorded in the EPPO GD (15 names, 7 languages, see Supplementary Table 2).This query returned 41,720 Tweets, of which 20,236 were novel posts (not "Retweets").We relied on the post source location derived by Brandwatch 40 to geolocate posts and mapped post intensity by country.Geolocation was available at the country scale for 22,879 Tweets (10,060 excluding Retweets).We relied on the post date to evaluate post intensity over time.The full results showed a high volume of posts in 2015 (12,126 Tweets, more than twice that of the next highest year, 2021 with 5355 Tweets).The majority of these Tweets (10,501 out of 12,126 Tweets) included mention of "EurekaMag", a scholarly literature delivery service.We excluded these posts to focus our further analyses on the content shared on Twitter by a broader variety of users (11,570 Tweets over all 10 years).We further cleaned this data to remove URLs and eliminate duplicates, resulting in 9034 non-EurekaMag Tweets with unique text content.

Topic modeling with NLTK and LDA
To evaluate Tweet content, we generated two sets of topics derived statistically from word frequencies.We created an NLTK corpus using the text from all Tweets and evaluated 30 collocations (word pairs that appear together more often than expected by chance) 46 .This produced a set of initial topics in posts that included references to the Irish potato famine ("irish potato", "potato famine", "hambruna irlandesa"), potato-growing regions ("rift valley"), other plant, animal and human diseases ("bird flu", "dengue fever", "yellow fever", "lassa fever", "bacterial wilt", "downy mildew", "solani alternaria", "early blight", "ricin toxin"), disease pathways and research topics ("rxlr effectors", "clonal lineages", "resistance genes", "technique accelerates", "accelerates isolation") and general terms pertaining to crop hosts, disease surveillance, and food security ("tomato plants", "solanum tuberosum", "potato variety", "plant pathology", "food security" "lutter contre", "alerta temprana").We used Latent Dirichlet Allocation (LDA) implemented with the Python package tomotopy to further model topics between documents (here, Tweets) in English (https:// pypi.org/ proje ct/ tomot opy/).LDA generates topics in a way that takes into account the common content found within documents, and each document is assigned a probabilistic distribution across the topics 51 .Each of 35 topics was described by its 10 most relevant words and its sum across document probability distributions (Supplementary Table 3).We used keyword searches derived from this topic in the Twitter feeds to further evaluate direct observations of late blight.Manual evaluation of text revealed the presence of first-hand reports of the disease, frequently at specified geographic locations (e.g., a county, state, or country).Results for the terms "reported", "confirmed", "found in", "found on", "case of ", "spreading", "identified" included descriptive geographic reports, primarily in locations in the US and Canada (e.g., "New Jersey tomato and potato crops threatened by late blight.4 new cases reported in Salem County N.J. this week.", "PEI agriculture officials now say late blight has been found in potato field in Freetown area"), as well as country-scale reports (e.g., "#Potato Late blight genotype EU_33_A2 reported in #Nigeria") and several first-hand accounts of the disease (e.g., "Sad tonight about having to cut down all my tomato plants because of spreading late blight.Only 6 ripe fruit!").

Text classification with Scikit-learn
To automatically identify Tweets that contain first-hand reports of the disease, we produced a hand-labeled dataset of positive and negative cases.We used unique posts in English from our topic modeling analysis, and machine translation to translate the remaining 2499 unique non-English Tweets.Translation was successful for 1750 Tweets, with rates of failure varying from 24 to 32% for Tweets in the top 5 non-English languages appearing in the dataset (Spanish, French, Romanian, German, Dutch).We randomly sampled 200 of the resulting 8665 Tweets and manually labeled each post as positive or negative for including direct observations or reports of late blight presence (termed "direct sighting").Using the keywords previously identified with LDA, we labeled an additional 85 positive examples.
We used the labeled data set to train and evaluate several machine learning algorithms to automate classification of posts as "direct sightings" of late blight.For all methods, the original Tweet was cleaned to remove "at" tags (@ + a username), the Retweet flag ("RT"), and URLs.We used the Python Scikit-learn implementation

Saffer 1 ,
Laura Tateosian 1 , Amanda C. Saville 2 , Yi-Peng Yang 1 & Jean B. Ristaino 2,3* In 1843, a hitherto unknown plant pathogen entered the US and spread to potato fields in the northeast.By 1845, the pathogen had reached Ireland leading to devastating famine.Questions arose immediately about the source of the outbreaks and how the disease should be managed.The pathogen, now known as Phytophthora infestans, still continues to threaten food security globally.A wealth of untapped knowledge exists in both archival and modern documents, but is not readily available because the details are hidden in descriptive text.In this work, we (1) used text analytics of unstructured historical reports (1843-1845) to map US late blight outbreaks; (2) characterized theories on the source of the pathogen and remedies for control; and (3) created modern late blight intensity maps using Twitter feeds.The disease spread from 5 to 17 states and provinces in the US and Canada between 1843 and 1845.Crop losses, Andean sources of the pathogen, possible causes and potential treatments were discussed.Modern disease discussion on Twitter included near-global coverage and local disease observations.Topic modeling revealed general disease information, published research, and outbreak locations.The tools described will help researchers explore and map unstructured text to track and visualize pandemics.

Figure 1 .
Figure 1.Lexical dispersion plots of word frequencies for different search terms indicated for potato disease from the US Commissioner of Patent Reports from (a) 1843 (b), 1844 and (c) 1845.

Figure 2 .
Figure 2. Google Ngram plot of three 2-g keywords searches for "potato disease", "potato rot" and "late blight" in publications from the English corpus 1800-2019.

Figure 4 .
Figure 4. Global distribution map of 10 years (2012-2022) of Twitter mentions of late blight and related terms (tomato blight, potato blight, tuber rot, foliar blight, and variations of descriptions of Phytophthora infestans.(a) EurekaMag posts excluded and Retweets included.Insert shows peak velocity of Tweets occurred in September 2020; (b) EurekaMag and Retweets excluded.Insert (bottom left) shows that peak volume of Tweets occurred in July 2013.The maps in Fig.4were created using the geopandas package (version 0.12.2, https:// geopa ndas.org/ en/ stable/) in Python 3.10.9.The World Countries Generalized shapefile was obtained from ESRI (https:// hub.arcgis.com/ datas ets/ esri:: world-count ries-gener alized/ about, and used to create this map.

Figure 5 .
Figure 5. Late blight reports extracted and manually validated from Tweets (2012-2022) using supervised text classification and Named Entity Recognition with geocoding.(a) Number of times late blight was reported in posts globally, aggregated to the country scale.(b) The number of times late blight was reported in posts for the United States and Canada, aggregated to the state scale (shown in blue).Counts of location mentions at the city scale are aggregated to a 60 × 60 km grid (shown in red).Bar charts in both figures show the counts of omission (missed reports) and commission (false reports) errors identified during data validation for countries (a) and states (b) with the most errors.The maps in Fig.5were created using the geopandas package (version 0.12.2, https:// geopa ndas.org/ en/ stable/) in Python 3.10.9.The World Countries Generalized shapefile was obtained from ESRI (https:// hub.arcgis.com/ datas ets/ esri:: world-count ries-gener alized/ about), and used for (a).The North America Political Boundaries shapefile from the Commission for Environmental Cooperation (CEC) (http:// www.cec.org/ north-ameri can-envir onmen tal-atlas/ polit ical-bound aries-2021/) was used in (b). https://doi.org/10.1038/s41598-024-52870-2 five state area (New York, Delaware, Massachusetts, New Jersey and Pennsylvania) to the rest of the northeastern US and Canada.